Goto

Collaborating Authors

 snr 0


Variable Selection Using Relative Importance Rankings

Chang, Tien-En, Chen, Argon

arXiv.org Machine Learning

Although conceptually related, variable selection and relative importance (RI) analysis have been treated quite differently in the literature. While RI is typically used for post-hoc model explanation, this paper explores its potential for variable ranking and filter-based selection before model creation. Specifically, we anticipate strong performance from the RI measures because they incorporate both direct and combined effects of predictors, addressing a key limitation of marginal correlation that ignores dependencies among predictors. We implement and evaluate the RI-based variable selection methods using general dominance (GD), comprehensive relative importance (CRI), and a newly proposed, computationally efficient variant termed CRI.Z. We first demonstrate how the RI measures more accurately rank the variables than the marginal correlation, especially when there are suppressed or weak predictors. We then show that predictive models built on these rankings are highly competitive, often outperforming state-of-the-art methods such as the lasso and relaxed lasso. The proposed RI-based methods are particularly effective in challenging cases involving clusters of highly correlated predictors, a setting known to cause failures in many benchmark methods. Although lasso methods have dominated the recent literature on variable selection, our study reveals that the RI-based method is a powerful and competitive alternative. We believe these underutilized tools deserve greater attention in statistics and machine learning communities. The code is available at: https://github.com/tien-endotchang/RI-variable-selection.


Minimax-Optimal Dimension-Reduced Clustering for High-Dimensional Nonspherical Mixtures

Huang, Chengzhu, Gu, Yuqi

arXiv.org Machine Learning

In mixture models, nonspherical (anisotropic) noise within each cluster is widely present in real-world data. We study both the minimax rate and optimal statistical procedure for clustering under high-dimensional nonspherical mixture models. In high-dimensional settings, we first establish the information-theoretic limits for clustering under Gaussian mixtures. The minimax lower bound unveils an intriguing informational dimension-reduction phenomenon: there exists a substantial gap between the minimax rate and the oracle clustering risk, with the former determined solely by the projected centers and projected covariance matrices in a low-dimensional space. Motivated by the lower bound, we propose a novel computationally efficient clustering method: Covariance Projected Spectral Clustering (COPO). Its key step is to project the high-dimensional data onto the low-dimensional space spanned by the cluster centers and then use the projected covariance matrices in this space to enhance clustering. We establish tight algorithmic upper bounds for COPO, both for Gaussian noise with flexible covariance and general noise with local dependence. Our theory indicates the minimax-optimality of COPO in the Gaussian case and highlights its adaptivity to a broad spectrum of dependent noise. Extensive simulation studies under various noise structures and real data analysis demonstrate our method's superior performance.


A Data-Driven Investigation of Noise-Adaptive Utterance Generation with Linguistic Modification

Chingacham, Anupama, Demberg, Vera, Klakow, Dietrich

arXiv.org Artificial Intelligence

In noisy environments, speech can be hard to understand for humans. Spoken dialog systems can help to enhance the intelligibility of their output, either by modifying the speech synthesis (e.g., imitate Lombard speech) or by optimizing the language generation. We here focus on the second type of approach, by which an intended message is realized with words that are more intelligible in a specific noisy environment. By conducting a speech perception experiment, we created a dataset of 900 paraphrases in babble noise, perceived by native English speakers with normal hearing. We find that careful selection of paraphrases can improve intelligibility by 33% at SNR -5 dB. Our analysis of the data shows that the intelligibility differences between paraphrases are mainly driven by noise-robust acoustic cues. Furthermore, we propose an intelligibility-aware paraphrase ranking model, which outperforms baseline models with a relative improvement of 31.37% at SNR -5 dB.


Subset selection for linear mixed models

Kowal, Daniel R.

arXiv.org Machine Learning

Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates--while accounting for this structured dependence--remains a challenge. We introduce a Bayesian decision analysis for subset selection with LMMs. Using a Mahalanobis loss function that incorporates the structured dependence, we derive optimal linear actions for any subset of covariates and under any Bayesian LMM. Crucially, these actions inherit shrinkage or regularization and uncertainty quantification from the underlying Bayesian LMM. Rather than selecting a single "best" subset, which is often unstable and limited in its information content, we collect the acceptable family of subsets that nearly match the predictive ability of the "best" subset. The acceptable family is summarized by its smallest member and key variable importance metrics. Customized subset search and out-of-sample approximation algorithms are provided for more scalable computing. These tools are applied to simulated data and a longitudinal physical activity dataset, and in both cases demonstrate excellent prediction, estimation, and selection ability.


Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest

Zhou, Siyu, Mentch, Lucas

arXiv.org Machine Learning

Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged -- one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades-old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of random forests use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that random forests with shallow trees are advantageous when the signal-to-noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of "double descent" in random forests by drawing parallels to U-statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.


On the Benefits of Early Fusion in Multimodal Representation Learning

Barnum, George, Talukder, Sabera, Yue, Yisong

arXiv.org Artificial Intelligence

Intelligently reasoning about the world often requires integrating data from multiple modalities, as any individual modality may contain unreliable or incomplete information. On the other hand, the brain performs multimodal processing almost immediately. This divide between conventional multimodal learning and neuroscience suggests that a detailed study of early multimodal fusion could improve artificial multimodal representations. To facilitate the study of early multimodal fusion, we create a convolutional LSTM network architecture that simultaneously processes both audio and visual inputs, and allows us to select the layer at which audio and visual information combines. Our results demonstrate that immediate fusion of audio and visual inputs in the initial C-LSTM layer results in higher performing networks that are more robust to the addition of white noise in both audio and visual inputs. In many cases, an individual modality does not contain sufficient information to classify the scene.


Getting Better from Worse: Augmented Bagging and a Cautionary Tale of Variable Importance

Mentch, Lucas, Zhou, Siyu

arXiv.org Machine Learning

As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications. Tools like random forest have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships between features and the response. Motivated by recent insights into random forest behavior, here we introduce the idea of augmented bagging (AugBagg), a procedure that operates in an identical fashion to the classical bagging and random forest counterparts but which operates on a larger space containing additional, randomly generated features. Somewhat surprisingly, we demonstrate that the simple act of adding additional random features into the model can have a dramatic beneficial effect on performance, sometimes outperforming even an optimally tuned traditional random forest. This finding that the inclusion of an additional set of features generated independently of the response can considerably improve predictive performance has crucial implications for the manner in which we consider and measure variable importance. Numerous demonstrations on both real and synthetic data are provided.


Higher-Order Block Term Decomposition for Spatially Folded fMRI Data

Chatzichristos, Christos, Kofidis, Eleftherios, Kopsinis, Giannis, Theodoridis, Sergios

arXiv.org Machine Learning

Functional Magnetic Resonance Imaging (fMRI) is a noninvasive technique for studying brain activity, which receives an increasing attention in the last decade or so. During an fMRI experiment, a series of brain images is acquired, while the subject possibly performs a set of tasks responding to external stimuli. Changes in the measured blood-oxygen-level dependent (BOLD) signal are used to examine different types of activation in the brain. There are several objectives in the analysis of fMRI data, the most common of which are the localization of regions of the brain, that are activated by a task, and the determination of the functional brain connectivity [1, 2]. The localization of the activated areas in the human brain is a challenging "cocktail party" problem, where several people are talking (areas activated) simultaneously behind a wall (skull). Our goal is to distinguish those areas (spatial maps) as well as activation patterns (time courses) through some blind source separation (decomposition) method [3, 4]. Each source is the outcome of a combination of a time course with a spatial map. In fMRI studies of the brain function, the structure of the data involves multiple modes, such as trial, task condition, subject, in addition to the intrinsic dimensions of time and space [5].